COMBINED SCHEDULING AND MAPPING OF DIGITAL SIGNAL 
PROCESSING ALGORITHMS ON A VLIW PROCESSOR 

Reference to Related Application 

The present patent application claims priority benefit of U.S. Provisional 
Application No. 60/240,151, filed October 13, 2000, titled "COMBINED 
SCHEDULING AND MAPPING OF DIGITAL SIGNAL PROCESSING 
ALGORITHMS ON VLIW DSPS," the content of which is hereby incorporated by 
reference in its entirety. 

Field of the Invention 

This invention relates to the optimization of signal processing programs, and 
more particularly, to a process for the combined scheduling and mapping of fully 
deterministic digital signal processing algorithms on a processor. 

Description of the Related Art 

Computational efficiency is critical to the effective execution of Digital Signal 
Processing (DSP) applications. Real-time DSP applications usually require processing 
large quantities of data in a short period of time. The DSP algorithms that comprise the 
DSP applications can be continuous and repetitive in nature, where operations are 
repeated in an iterative manner as samples are processed, and often possess a high 
degree of parallelism, where several separate operations can be executed concurrently. 

Because digital signal processing algorithms often possess a high degree of 
parallelism, multiple processors may work in parallel to perform the computations. 
Consequently, DSP applications are implemented on DSP hardware systems having 
multiple Functional Units (FUs) capable of processing data simultaneously. Such 
hardware systems comprise processors with FUs on a single chip architecture, referred 
to as Very Long Instruction Word (VLIW) architecture; where one long instruction 
word specifies the instructions to be performed by each of the FUs in a machine cycle. 
The TMS320C6xx/TMS320C64x ('C6xx) family of DSPs from Texas Instruments® 
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provides one example of a DSP processor with multiple functional units utilizing a 
VLIW architecture. The StarCore SC 140 by Motorola is another such example. 

To optimize the execution of DSP applications, the DSP algorithms should be 
implemented in a manner that exploits the processor architecture by utilizing 
instruction-level parallelism. Developing this parallelism, however, is a tedious task. 
Conventionally, a complier is used to detect parallel operations in a program and 
automatically map them onto the processor architecture. While effective in some cases, 
compiled code often does not utilize the full parallelism of the processor architecture. 

As an example, the 'C6xx DSP uses a RISC-like instruction set to aid the 
compiler with dependency checking. The compiler detects parallel operations in a 
program and attempts to schedule the instructions for optimal performance. In some 
special cases, the compiler is effective in producing parallel code. Nevertheless, code 
for complex algorithms, written in hand-coded assembly language, often outperforms 
compiler-generated code by a factor of 10-40. Writing parallel assembly language code 
by hand is a tedious and time consuming task, typically requiring many revisions of the 
code in order to detect and schedule the parallelism present in the algorithm. 

To improve the efficiency of mapping and scheduling, while minimizing the 
effort required, various techniques, particularly compiler-based solutions, have been 
proposed. None of these techniques, however, optimally utilize instruction-level 
parallelism. It is therefore needed to have an improved method and system to schedule 
and map the operations of a DSP algorithm onto a parallel computing system. 

Summary of the Invention 

The present invention addresses these and other problems by providing a method 
for scheduling computation operations on a very long instruction word processor so as 
to have a substantially optimal iteration period for a cyclic algorithm. 

One embodiment uses a flow graph wherein each computation operation appears 
as a separate node, and a plurality of edges represents data dependencies between the 
separate nodes. The scheduling and mapping problem is modeled on the basis of the 
DSP algorithm, and the processor architecture. The flow graph is transformed into 
machine-readable data for use in an integer linear program. The machine-readable data 
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expresses equations and constraints associated with the optimal iteration period of the 
algorithm implemented on a processor having a plurality of types of functional units. 
The equations and constraints comprise an objective function to be minimized, a set of 
operation precedent constraints, job completion constraints, iteration period constraints 
and functional unit constraints. The nature of the equations and constraints are modified 
based upon processor architecture. The minimum iteration period for completion of the 
computation operations, and the scheduling of nodal operations, is determined by 
computing an optimal solution to the integer linear program as a solution of its 
corresponding linear constraints. The computation operations are scheduled and 
mapped according to the optimal solution provided by the integer linear program. 

Brief Description of the Drawings 

These and other features and advantages of the present invention will be 
appreciated, as they become better understood by reference to the following Detailed 
Description when considered in connection with the accompanying drawings, wherein: 

FIG. 1 depicts a Fully Specified Flow Graph (FSFG) of a 2 nd order Infinite 
Impulse Response (IIR) filter; 

FIG. 2 is a block diagram of the functional units of the 'C6xx DSP; 

FIG. 3 depicts a FSFG of a 2 nd order IIR filter with memory access; and 

FIG. 4 is a block diagram of the data path of a StarCore processor 

Detailed Description of the Invention 

The present invention is a method and system for mapping and scheduling 
algorithms on parallel processing units. The present invention will presently be 
described with reference to the aforementioned drawings. Where arrows are utilized in 
the drawings, it would be appreciated by one of ordinary skill in the art that the arrows 
represent the interconnection of elements and/or the communication of data between 
elements. 

Defining the signal processing algorithm by using a fully specified flow graph 
(FSFG) decreases the development time of signal processing algorithms. A FSFG is 
defined by the 3-tuple (N£J)) where Af is a set of nodes that represent the atomic 
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operations performed on the data, E is a set of directed edges that represent the flow of 
data between different operations, and D is a set of ideal delays. 

The parameters characterizing an FSFG mapped onto multiple functional units 
include the following: 
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N 


the set of nodes 




E 


the set of directed edges 




D 


the set of ideal delays 




Pi/o 


a set of paths from input node to output node 
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a time that node i e TV completes its execution 


10 
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iteration penoa ^time alter wnicn next iteration can be started) 




di 


execution time of node i e N 




n vw 


a number of ideal delays on edge e(v, w) g E from node v to node 




where (v,we 


N) 




Di/o 


a throughput delay 


15 


Pr 


a number of processors of type r in the VLIW 




r 


a type of processor <= {adder, multiplier, register, etc.} 



Other variables can be optionally incorporated into a FSFG, such as cpfo a 
communication path between functional units j and k, cjfc a communication cost for 
communication path cpfo and ujfa a maximum number of communications on 

20 communication path cpjk at any one instant. 

FSFG graphs are normally cyclic, with data dependencies between iterations. 
The computational latency of node i is given by du and represents the time at which 
node / completes its execution. The nodes in the FSFG are atomic operations that are 
indivisible and depend on the computational capacity of the functional units. Atomic 

25 operations represent the smallest granularity of achievable parallelism. 

The FSFG of a 2 nd order IIR filter is shown in FIG. 1 . The input 1 50 is shown as 
signal x[n], and the output 151 is shown by the signal y[n]. Nodes n x 101, n 2 102, n 7 
107, and n 8 108 perform addition operations, while nodes n 3 103, n 4 104, n 5 105, and n 6 
106 perform multiply operations. 

30 The edges of the graph represent data dependencies between the nodes. Where 

more than one operation depends on the output of a node, each dependency is 
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represented as a separate edge. The separate edges are required for scheduling purposes. 
Node n 8 108 depends from nodes n 2 102 and n 7 107, and the dependencies are 
represented by edges e 2 122 and e„ 131, respectively. Nodes n 3 103, n 4 104, n 5 105, and 
n 6 106 also depend from node n 2 102, and the dependencies are represented by edges e 5 
5 125, e 6 126, e 7 127, and e 8 128, respectively. Edges e 6 126 and e 8 128 represent 

dependencies from node n 2 102 but with a delay, and edges e 5 125 and e 7 127 represent 
dependencies from node n 2 102 with two delays. Edges e } 121, e 3 123, and e 9 129 
represent dependencies from nodes n l 101, n 3 103, and n 5 105 to nodes n 2 102, nj 101, 
and n 7 107 respectively. Input signals a^ a 1? b 0 and b x [collectively not shown] represent 
10 the coefficients of the IIR filter and are inputted into n 4 104, n 3 103, n 6 106, and n 5 105 

respectively. 

The FSFG is also useful to define the parameters and constraints for a Mixed 
Integer Program (MIP). A mixed integer programming approach for optimally 
scheduling and mapping of algorithms onto a processor eases the process of hand 

15 coding. Mixed Integer Programming is similar to Linear Programming (LP), where a 

system is modeled using a series of linear equations. Each equation represents a 
constraint on the system. In addition to the constraints, there is an objective function, 
where the goal is to minimize (or sometimes maximize) the result. 

Mixed Integer Programming is useful when the feasible solutions have to be the 

20 equivalent of whole numbers or a binary decision. For example, assuming it is not 

feasible to schedule 1.2438 multiplication operations in a clock cycle, then the optimum 
number of multiplication operations must be 1 or 2. Simply rounding off values does 
not guarantee correct results, instead, Integer Programming must be used. 

The inherent constraints of the DSP and the scheduling requirements of the 

25 FSFG provide a starting point for writing an efficient signal-processing algorithm. 

Through trial and error, a programmer may eventually create an optimal algorithm. 
Through the use of Integer Linear Programming (ILP) techniques to automate this long 
and difficult task, a programmer can greatly reduce development time. With ILP, the 
incorporated variables are limited to integer values while with MIP a portion of the 

30 variables can have integer values and a portion of the variables can have real values. 
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The scheduling of parallel instructions is driven largely by the architecture of the 
DSP. A simplified data path of the 'C6xx DSP is shown in FIG. 2. The 'C6xx has 
eight functional units divided into two groups, each group having four functional unit 
types, labeled XI 210, .SI 220, .Ml 230,and .Dl 240, and 12 260,.S2 270, .M2 
5 280,. and D2 290. Each of the four unit types can perform different specialized 

operations, such as, arithmetic operations, byte shift operations, multiplication or 
compare operations, and address generation. Each group of four functional units is also 
associated with a register file 200, 250 containing 16, 32-bit registers, each. Each 
functional unit reads directly from and writes directly to the register file within its own 

10 group. Additionally, the two register files are connected to the functional units of the 
opposite side via unidirectional cross paths 202, 252. The 3 FU's on one side can 
access only one operand from the other side at a time. Both sides work independently. 
The only cross communication is via the cross paths, and these cannot be used to store a 
result on the register file of the other side. The 'C6xx also includes a control register 

15 204 for handling memory access. 

The multiple functional units of the 'C6xx DSP are controlled by the several 
basic instructions found in a single long instruction word. By carefully scheduling the 
parallel execution of independent basic instructions, a programmer can efficiently 
implement signal processing algorithms. 

20 The code for a 'C6xx DSP must provide for the transfer of data from memory or 

registers between the two groups of functional units using the cross paths 202, 252. The 
two groups of functional units are connected by their register files 200, 250, so all 
communications between them must go through the registers. This requires modifying 
the FSFG to include storage of results into the registers as a node. 

25 FIG. 3 shows a new FSFG of the 2 nd order IIR filter with memory nodes at the 

output of every original node. Edges e x 321, e 3 323, e 7 327, e 8 328, e 13 333, e I4 334, and 
e 17 337 provide data for memory nodes n 9 309, n 10 310, n„ 311, n 12 312, n 13 313, n 14 314, 
and n 15 315, respectively. Edges e, 321, e 3 323, e v 327, e 8 328, e 13 333, e 14 334, and e 17 
337 represent dependencies from nodes n, 101, n 2 102, n 3 103, n 4 104, n 5 105, n 6 106, 

30 and n 7 107, respectively. 
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Node n 8 108 depends from nodes n 10 310 and n 15 315, and the dependencies are 
represented by edges e 6 326 and e lg 338, respectively. Nodes n 3 103, n 4 104, n 5 105, and 
n 6 106 also depend from node n 10 310, and the dependencies are represented by edges e 9 
329, e 10 330, e„ 331, and e 12 332, respectively. Edges e 10 330 and e 12 332 represent 
dependencies from node n 10 310 but with a delay, and edges e 9 329 and e n 331 represent 
dependencies from node n 10 310 with two delays. Edges e 2 322, e 4 324, and e 15 335 
represent dependencies from memory nodes n^ 309, n n 3 1 1 , and n 13 3 13 to nodes n 2 
102, n x 101, and n 7 107 respectively. Input signals ^ 160, a, 161, b 0 170 and b x 171 
represent the coefficients of the IIR filter. 

Signal processing algorithms typically run through repeated iterations of a 
computation process. Because of the cyclic nature of signal processing algorithms, 
optimizing the iteration period results in optimization of the entire algorithm. Ideally, 
the iteration period takes a single cycle to complete. This is usually not possible, 
however, because data dependencies prevent performing all the nodes at the same time. 
Additionally, the number of functional units on the 'C6xx DSP is limited, so a single 
iteration period may take several VLIW cycles to complete. 

Minimization of the Iteration Period (r) and the periodic throughput delay Di/ 0 
provides the optimal schedule when given limited processing resources. The iteration 
period can be expressed by the equation 



bounds, only a single iteration period can be deemed valid and true, namely have the 
value of 1 . 

The throughput delay Dy 0 is given by the expression 



By weighting the iteration period by a factor of T 9 both the iteration period and 
the throughput delay can be optimized with a single equation. Using T ensures that the 
weighted iteration period is greater than the maximum possible throughput delay. 




(output) pt 



(input) pt 



p=l t=\ 
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Even though the minimum iteration period is not known in advance, the 
programmer can often make a reasonable estimate of the expected value. Setting a 
lower bound b{ and an upper bound b u for possible iteration time periods reduces the 
computing time required to solve the minimization equation. The objective function is 
5 to optimize the iteration period and throughput delay by minimizing the expression 

K ^ T P r T 

^Z J T j + Z Z X (output) pt ~~ Z Z X (input) pt 
j=bj p=\ /=1 p=\ t=\ 

After specifying the objective function, integer linear programming also requires 
defining the constraints. Inputs to some nodes depend from outputs of other nodes, so 
not all the nodes in the FSFG can be processed in parallel. Constraints are used to define 
10 nodes that must be processed in sequential order. Given that node v precedes node w, 
the time at which node w is processed must be greater than the time at which node v is 
processed. Further, this difference in time must be greater than the difference between 
the computational throughput delay and the cost of ideal delays for a given iteration 
period. This concept is expressed by the equation 

K 



T P r 



where * f = 2>2X 



i P t 

r-i P =i 



This equation does not model the costs associated with memory and registers. 
The functional units can communicate by using the cross paths or store data in memory, 
and these communication costs must be factored into the operation precedence 
20 constraints. The communication costs are given by the expression 



T P r 



t=i p 2 =\ Pl =i 



PiP\ X hP\t 



Combining these expressions, the operation precedence constraint is defined by 
the equation 

7^ Pr TP, b v T P r P r 
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The above expression is nonlinear and cannot be solved by existing MIP solvers. 
Therefore the Oral and Kettani transformation is applied to linearize the expression as 
follows: 

Let yi 2 p 2 t ~ X hPi t ^L C PiP\ X hP\t such that 



hP2 t 



0 '/^=o 

c p 2 px x hPxt if x i zP2 t = 1 



Replace the nonlinear y iiPit with a linear expression 

p r 

X c P2Pi x kPit ~~ b p2 (l - x hPit )+ z kpit 
where b n = Vc 



PlP\ 
Pi 



then 

'=1 P2=l '=1 Pi=l j^lb 

T P r ( P r 
^1/^2=11^=1 



All nodes of the FSFG must be scheduled for processing a single time within 
each iteration period. This job completion constraint is shown by the expression 

T P r 

X X x ipt for all nodes / = 1,2,. . N 

t=\ P =\ 

Only one iteration period is selected from the range of iteration periods. This 
1 5 iteration period constraint is shown by the expression 

K 

j=b, 

The iteration period is being minimized, so more than one time value can be 
assigned to the iteration period. The functional unit modulo constraint ensures that, at 
most, Pf u processors are used for each time classes. There are b u -bi+\ sets of 
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iteration period. To model this, each set must be specified to constrain the problem only 
if its iteration period is optimal 

A Functional Unit of type fu can do the operation of type fu because it represents 
the set of time classes for which an operation remains alive on a FU. 

for* = 1,2,..., « = 0,1,...,6/-1.S' W = {s \s mod bi = n } 

ieN r /?=1 seS n 

for f = 1,2,..., T w = 0,l,..., A w -l,S n = {slsmod6 w = H } 
M should be greater than P fu so that an either-or-constraint condition is met. 
1 0 N fu = set of nodes mapped on the FU of type fu. 

The DSP is limited to accessing a single operand for each of the two cross paths. 
This load constraint is shown by the expression 

2 2 x kPit 2 x hPt - 1 for each time class t=l T. 

i 2 J^L p 2 =\ p { =\ 

After linearization this quadratic expression becomes 

15 S XlIX/v + ^ 2 ( 1 ~ x ^) + z ^ f- 1 where Ppft belong to different 

sides 

The linearization process adds the following constraints to the MIP 

Pi 

Z h Pl t - ^ for a11 store ed S es ^ for all / = \,..,T , p 2 = l,..,P fu and 
A 

z i 2PlP2 t ^ 0 for all load edges 
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The performance of an operation by the FUp on a node i at time t is represented 
by the setting the value of x\p t to 1. If no operation is performed with those parameters, 
the value is set to 0. This 0-1 constraint is shown by the expression 
[l node i is processed by FU p at time t 



1=1,2,... 
/=1,2,...,7 

N = Number of operation Nodes in the FSFG 
Pf u = Number of FUs of Type fu in the VLIW 
fu e {Adder, Multiplier, Register} etc. 
T= Number of time classes considered. 

The following example shows the results for a 2 nd order IIR filter shown in FIG. 

3. 

N = 15 as shown in FSFG of Figure 3. 

P a = the Number of Adders in the 'C6xx 

P m = the Number of Multipliers in the 'C6xx 

Pr = the Number of Registers in the ? C6xx 

7=- 8 (approximate time to serially process the 8 nodes) 

b u =3 the upper bound estimate of the iteration period, which can be arbitrarily 
chosen, provided it is between the maximum number of nodes divided by the number of 
functional units and maximum nodes. 

bj=2 the lower bound estimate of the iteration period (8 nodes with 4 functional 

units) 

The objective function is given by the expression 




0 otherwise 



3 2 8 2 8 



Minimize: 8 X T J + X X X *pt ~ X X X ^pt 



j=2 p=\ t=l p=l t=\ 



The precedence constraints are given by the expressions 



S 2 8 10 3 
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20 



for load edges {2, 4, 5, 6, 9, 10, 11, 12, 15, 16, 18} 

8 2 8 5 3 

~I/2X/* + 2>Z^/v + "¥ 2 Z^' r y 

t=l p 2 =\ t =\ px =\ j= 2 



T 2 



-EE 



>0 



for store edges {1,3,7,8,13,14,17} 

The job completion constraint is given by the expression 

8 P r 

5 22*9* = * » forallnodes z = 1,2,..., 15 

The iteration period constraint is given by the expression 

The processor constraints are given by the expressions 

isN r ssS n 

10 forS 0 = {1,3,5,7} Sj ={2,4,6,8} 

{1,2,7,8} additions 
N m ={3,4,5,6} Multiplications 
N r = {9,10,11,12,13,14} load/store 

222^< / >+(^+ 1 )( 1 -r 3 ) 

isN r p=\ seS n 

15 for S 0 = {1,4,7} ,S 7 ={2,5,8} S 2 = {3,6} 

A/^= {1,2,7,8} additions 
N m ={3,4,5,6} Multiplications 
N r = {9,10,11,12,13,14} load/store 
The load constraints are given by the expressions 



where /? 1? /? 2 belongs to different sides 
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The linearization process adds the following constraints to the MIP 

Pl =i 

and z hp2t >0 for all store edges {1,3,7, 8,13,14,17}, for all FUs and M,2,..., 8 

Pi 

and z iApit >0 for edges {2,4,5,6,15,16, 18} for all FUs and *=1,2,..., 8 
These equations are representative of equation sets which, when taken 
individually, can be solved using any known commercially available Integer Program 
solver operating on a computer having a central processing unit and memory. One of 
ordinary skill in the art would appreciate that, with the equations given above, equation 
sets can be derived that act as inputs to commercially available IP solvers and that 
results in outputs which detail a combined schedule and map of the algorithm onto the 
processor architecture. 

The results of the process are shown in Table 1 . The optimal iteration period is 
calculated to be 3, with the nodes scheduled as shown in Table 1 . Time slots Tl , T2, 
and T3 represent the three periods and the nodes are listed thereunder. It should be 
noted that node 8 from the previous iteration (the previous iteration is represented by the 
-1 superscript notation) is processed at the same time as nodes 3 and 5 from the 
following iteration. The far left hand column represents the functional units performing 
the iterated functions. Based on this, the DSP algorithm can readily be programmed. 





Tl 


T2 


T3 


.Ml 


3 1 


4 1 




.M2 


5 1 


6 1 




XI 




l 1 


2 1 


X2 
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Table 1 Combined Schedule for 2 nd Order IIR Filter for C6X 

In a second embodiment, the invention is used to schedule and map a digital 

signal processing algorithm onto a StarCore SC 140 VLIW processor. The scheduling 

of parallel instructions is, as aforementioned, directed by the architecture of the DSP. As 
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shown in FIG. 4, the simplified data path 400 of the StarCore processor has four FUs 
410 and a 40-bit register file 420, which has sixteen registers [not shown individually]. 
All the FUs 410 are same, containing an ALU with a MAC and a bit operation unit. 
Thus, any operation can be assigned to any FU 410. This type of architecture is 
homogeneous and presents less scheduling constraints. 

As previously discussed, in the scheduling process the iteration period and the 
periodic throughput delay must be minimized. In this embodiment, however, cross-path 
communication is not an issue, because of a different architecture relative to the 
previously examined processor. As such, the equations and constraints differ from the 
previously discussed exemplary application. 



f 1 node i is scheduled at time t 



1 0 otherwise 



f=1A..JV, *=l,2,...,r 



N= Number of operation nodes in the FSFG, 
T= Number of time classes considered 
The necessary objective function to be minimized is 

K T T 

where o = output node and i = input node 

Precedence constraints are determined by modeling processor behavior. In this 
case, where node i x precedes node i 2 , a precedence constraint is established, shown as 

IX ~zX ~ d h + hi 2 2>, > 0 

f=l t=\ j=bj 

for all edges e(i x ->i 2 )eE where node /, must be scheduled before node i 2 . 

The variables b\ and b u represent the lower and upper bounds of iteration period, rand 

n kh is the number of ideal delays on Edge e(i } ->i 2 )eE. 

The job completion constraints are set by the requirement that all nodes must be 
scheduled as: 

T 

^ x n = 1 , for all nodes i = 1 ,2,. . TV 

-14- 

LA2:587903.1 



Since only one iteration period is to be selected out of a range of iteration 
periods, the iteration period equation is: 



As previously noted, the processor being used has 4 identical FUs. Therefore, at 
any given point in time, each of the FUs can be concurrently scheduled. 



ssS„ 

for i = 1,2,..., N n = 0,1 5 ..., b u -\, S n = {s | smod b u = n} 

M should be greater than 4 so that either-or-constraint condition is met. 

TV = set of nodes mapped on the FU. 

x it € {0,1 for all i =1,2,. . , and f = 1,2,. . T 
As a practical example, where a 5 th order digital filter needs to be mapped onto the 
StarCore processor, a FSFG is generated, with nodes and dependencies defined. Once 
complete, representative expressions and constraints are determined. In this case: 

r=l,2,...,26, *=1,2,...,20 

The objective function is given by the expression: 





15 20 



20 



20 2>/ + I> 



34/ 



\t 



7-10 t=l 



Operation Precedence Constraints are given by the equation: 



20 20 15 




Job completion constraints are given by the expression: 



20 




for all nodes i - 1,2,. . ., 26 



Iteration period constraints are given by the expression: 



15 




7=10 
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10 



15 



20 



FU constraints are given by the expression: 
2X<4 + 5(l-r y ) 

S£S„ 

for i = 1,2,. 26 « = 0,1,. . Z>/-1. £ rt = | s mod b\ = * } 

0-1 Constraints are given by the expression: 
e {0,1 for all i =1,2,..., 26 , and t = 1,2,..., 20 

The expressions can be solved with any known, commercially available Integer 
Program solver. One of ordinary skill in the art would appreciate that, with the 
equations given above, equation sets can be derived that act as inputs to commercially 
available EP solvers and that results in outputs which detail a combined schedule and 
map of the algorithm onto the processor architecture. 

The resulting schedule of 5 th order digital wave filter is shown in Table 2. The 
optimal iteration period is calculated to be 10, with the nodes scheduled as shown in 
Table 2. Time slots Tl through T10 represent the ten periods and the nodes are listed 
thereunder. It should be noted that nodes 24, 25, and 1 1 from the previous iteration (the 
previous iteration is represented by the -1 superscript notation) is processed at the same 
time as node 2 from the following iteration. The far left hand column represents the 
functional units performing the iterated functions. Based on this, the DSP algorithm can 
readily be programmed. 
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The foregoing description of a preferred implementation has been presented by 
way of example only, and should not be read in a limiting sense. Although this invention 
has been described in terms of certain preferred embodiments, namely in terms of two 
specific processor types, other embodiments that are apparent to those of ordinary skill in 
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the art, including embodiments which do not provide all of the benefits and features set 
forth herein, are also within the scope of this invention. 
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